Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules#4292
Conversation
f36afad to
0f07b30
Compare
0f07b30 to
113c46a
Compare
614115f to
9461d5a
Compare
|
/ok to test 73b71e7 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25928397787 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25928461458 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25938098213 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25938124861 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25942161774 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25947850162 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25954504754 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25957305634 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25963582311 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25967578667 |
…_impl API PR NVIDIA#3797 (A2A overlap for Megatron-FSDP) landed on main after this refactor branch diverged, and it added a partial-CUDA-graph guard in mcore_fsdp_adapter.py that iterates the legacy config.cuda_graph_scope list. After the merge, that field is normalized to None and the iteration raises TypeError: 'NoneType' object is not iterable, breaking every a2a_overlap unit test and the deepseek_proxy_fsdp_ep2_fsdp2_ep_overlap functional test. Replace the legacy iteration with the equivalent new-API assertion on config.cuda_graph_impl: per-layer CUDA graphs (local / transformer_engine) are still rejected; full_iteration and none remain allowed. Drop the now-unused CudaGraphScope import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ok to test e08b7e2 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26007684237 |
What does this PR do ?
Dev PR #4293
This PR decomposes the overloaded
cuda_graph_scopefield into three dedicated, semantically distinct concepts and cleans up naming throughout.Problem
The old API overloaded
--cuda-graph-scopewith three unrelated concerns:--cuda-graph-scope full_iteration→ full-iteration training graphs (a capture strategy, not a module)--cuda-graph-scope full_iteration_inference→ block-owned inference graphs (inference ownership, not a module)--cuda-graph-scope attn mlp→ per-layer capture regions (the actual intended use)CudaGraphScopemixed iteration-level control flow with per-layer module selection. The three concepts have nothing in common and cannot be meaningfully combined in one field.Solution
Four concrete changes:
full_iterationbecomes its owncuda_graph_implvalue.--cuda-graph-impl full_iterationreplaces--cuda-graph-impl local --cuda-graph-scope full_iteration.--inference-cuda-graph-scopeis a new dedicated field.InferenceCudaGraphScope(none/layer/block) replacesfull_iteration_inferenceincuda_graph_scope. The default for--cuda-graph-impl localislayer(preserving prior behaviour).CudaGraphScopeis renamed toCudaGraphModule;cuda_graph_scopetocuda_graph_modules.With the two non-module values removed, the enum and field names now accurately reflect their purpose: selecting which per-layer modules to capture.
Normalization is centralized in
cuda_graph_config.py.normalize_cuda_graph_modules,normalize_inference_cuda_graph_scope, andvalidate_deprecated_cuda_graph_modules_migration_inputsare shared betweenTransformerConfig.__post_init__andvalidate_args, removing duplication and making the migration logic testable in isolation.Backward compatibility
All deprecated inputs are still accepted and silently migrated at startup:
--enable-cuda-graph--cuda-graph-impl local--external-cuda-graph--cuda-graph-impl transformer_engine--cuda-graph-scope full_iteration--cuda-graph-impl full_iteration--cuda-graph-scope full_iteration_inference--cuda-graph-impl local --inference-cuda-graph-scope block--cuda-graph-scope attn mlp--cuda-graph-modules attn mlpfrom megatron.core.transformer.enums import CudaGraphScopeCudaGraphScopealias kept; useCudaGraphModuleTransformerConfig(cuda_graph_scope=...)cuda_graph_scopefield kept, deprecated; usecuda_graph_modulesConflicting combinations (e.g. passing both a deprecated value and the new flag with an incompatible value) are rejected with a clear assertion.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.